Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this third lecture in a series of seven. Today we will pick up where we left off last week with DataFrames. Now that we know some of the basics regarding this data structure, we can drill into using them to properly format your data for further analyses.
At the end of this lecture we will aim to have covered the following topics:
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on using Python lists and the NumPy package
An Excel book of five sheets, where each sheets is a dataset. We will use this dataset to demonstrate how to import Excel files into Python using Pandas.
This is a dataset resulting from the phase 1 of the Human Microbiome Project (HMP). The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.
In very general terms, the microbial DNA is extracted from a sample and the 16S rRNA gene is amplified using polymerase chain reaction (PCR), the gene amplicons sequenced, and the resulting sequences are matched against a reference database. The results are presented as counts of Operational Taxonomic Units (OTU), which are the equivalent to microbial species for data interpretation.
There are two datasets located in the data folder which we're interested in:
human_microbiome_project_measurements.csv
human_microbiome_project_otu_taxa_metadata.csv
These files, however, are excessively large - especially the metadata file. Therefore we'll be using a subset of the those files called:
human_microbiome_project_measurements_subset.csv
human_microbiome_project_otu_taxa_metadata_subset.csv
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
random is a package with methods to add pseudorandomness to programs
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
os
pandas
matplolib
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
#!pip install openpyxl
Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a dataset is formatted.
In a long formatted dataset, each column is a variable and the results of each measured variable are stored in rows (observations). In contrast, not every column in wide formatted data is necessarily a variable so you can have several observations of the same type of variable in the same row. The names long and wide come from the general shape that the two data formats have.
For data science applications, long format is preferred over wide format because it allows for easier and more efficient computations, data subsetting and manipulation. Wide format is more friendly to the human eye and easier to work with when data needs to be manually recorded/input. Therefore, having the ability to interconvert between these two data formats is a valuable and required skill. The following is a general scheme of wide- (left) and long-format (right) datasets:
| While more readable and technically more compact, the wide data format is not easily parsed for data analysis compared to the long data format. |
Why tidy data?
Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
Tidy data has:
This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
The 5 most common problems with messy datasets are:
Fortunately there are some tools available to solve these problems.
In our case, we have two datasets ready in our data folder to import for this lecture:
hmp_data¶data/human_microbiome_project_measurements.csvA comma-separated file with information about the OTUs (Operation taxonomic units), such as the number of times an OTU was detected (count), part of the body where the sample came from, the subject's sex, etc. In total there are 2,898 rows and 43,149 columns.
hmp_taxa_metadata¶made from importing data/human_microbiome_project_otu_taxa_metadata.csv
A dataset with taxonomic information of each OTU. Seven Rows and 43,141 columns. This could be considered a meta-data table as it does not pertain directly to measured data but rather information about the variables or categories in our data.
| More than 50% of your time may be spent dealing with the shape and format of your data. Once that is figured out, the analysis can seem rather simple! |
We'll begin by learning how to import different data formats. Next we will use Pandas to introduce concepts of data wrangling to convert out datasets from wide to long formats. When we are ready, we will merge the two files into a single dataset that we can use for exploratory data analysis.
These datasets are part of the R package HMP16SData. Details on how to prepare the data for download can be found in the HMP16SData vignette. To read more about the HMP project, visit https://www.hmpdacc.org/.
Given the popularity of Excel, it is common to find ourselves working with Excel books. Before importing the file, we need to know our current working directory (where in the computer are we working from?), change the directory if necessary, and list the files available for importation. For those purposes we will use the built-in OS library
os package provides access to the operating system¶In order to access functions and files in your operating system, the os package provides utilities that facilitate navigating through directories and accessing system information. Here's a table with some helpful functions and their description.
| Function | Description |
|---|---|
| getcwd() | Retrieve the current working directory. This is the location where Python thinks you're working from. This is usually the directory holding your Jupyter notebook. |
| chdir() | Change the current working directory. This can be an absolute path (ie C:/Users/yourname/) or relative to where you currently are (ie data/) |
| listdir() | List the files and folders in the current directory (default) or in the directory specified as a parameter. |
| mkdir() | Make a new directory (aka folder) in the current directory (default) or by providing a relative or absolute path. Note: you will get an error if attempting to make a directory that already exists. |
| rename() | Rename a folder or file. You can even move things to new directories as long as the destination path exists. |
| rmdir() | Remove a directory but will give an error if the directory does not exist |
| remove() | Remove a file (path) as long as it exists. |
# Import the os package. No need to rename it.
import os
# What is my current working directory?
# Use the print command to make it more readable
print(os.getcwd())
C:\Users\mokca\Dropbox\!CAGEF\Course_Materials\Introduction_to_Python\2022.01_Intro_Python\Lecture_03_Data_Formatting
# Change current working directory if needed
os.chdir("data/")
print(os.getcwd())
C:\Users\mokca\Dropbox\!CAGEF\Course_Materials\Introduction_to_Python\2022.01_Intro_Python\Lecture_03_Data_Formatting\data
# List files in directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_measurements.csv', 'human_microbiome_project_measurements_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_metadata.csv', 'human_microbiome_project_otu_taxa_metadata_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_data_taxa_merged.csv', 'subset_taxa_metadata_merged.csv']
# Create or make a directory
os.mkdir("test_directory")
# Check out how that's changed the directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_measurements.csv', 'human_microbiome_project_measurements_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_metadata.csv', 'human_microbiome_project_otu_taxa_metadata_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_data_taxa_merged.csv', 'subset_taxa_metadata_merged.csv', 'test_directory']
# Rename a directory or file
os.rename(src = "test_directory", dst = "renamed_directory")
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_measurements.csv', 'human_microbiome_project_measurements_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_metadata.csv', 'human_microbiome_project_otu_taxa_metadata_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'renamed_directory', 'subset_data_taxa_merged.csv', 'subset_taxa_metadata_merged.csv']
os.rmdir("renamed_directory/") # Remove a directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_measurements.csv', 'human_microbiome_project_measurements_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_metadata.csv', 'human_microbiome_project_otu_taxa_metadata_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_data_taxa_merged.csv', 'subset_taxa_metadata_merged.csv']
read_excel() to import your data¶So we covered an important series of os-based commands that are helpful in moving and renaming our data. Sometimes the data itself might have issues, which is all part of data wrangling. To simplify our process, however, the pandas package has already created a function to deal with headers and other issues that can appear when importing .xlsx files and yes, it runs openpyxl under the hood (See Appendix 1, section 6.0.0).
By default read_excel() will use the first sheet in the workbook but we can also use the sheet_name parameter with the actual sheet name, or it's sheet position. The resulting imported data will be stored in a pandas.DataFrame.
# Read the file in from the data folder to a DataFrame with the first sheet as default
pd.read_excel("miscellaneous.xlsx")
# By default, Jupyter Notebooks will only show us a truncated version of a DataFrame
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
# Read a specific sheet in based on sheet name
pd.read_excel("miscellaneous.xlsx", sheet_name="All alphabetised")
| Title | No of mentions | |
|---|---|---|
| 0 | 1974 | NaN |
| 1 | 1977 | NaN |
| 2 | 1984 | NaN |
| 3 | 1984 | NaN |
| 4 | 1984 | NaN |
| ... | ... | ... |
| 1998 | Zazie in the Metro | 2.0 |
| 1999 | Zen and the Art of Motorcycle Maintenance | NaN |
| 2000 | Zen and the Art of Motorcycle Maintenance | 2.0 |
| 2001 | Zorba the Greek | NaN |
| 2002 | Zuleika Dobson | NaN |
2003 rows × 2 columns
ExcelFile() class to get a list of sheet names¶Although we can read in sheets by position, unless we know the sheet names a priori it makes it hard to call on a sheet name specifically. It can be done quite easily with the openpyxl package (See Appendix 1, section 6.0.0) but we can also intialize an ExcelFile object which will have the .sheet_names property.
In fact, over the years, the function ExcelFile.parse() has become virtually identical to the read_excel() function if we were to take a look under the hood.
You can also select multiple specific sheets to read in after you've created an ExcelFile object.
# Generate an ExcelFile object which will contain information about our excel file
microbes_xl = pd.ExcelFile('miscellaneous.xlsx')
# What is this object?
type(microbes_xl)
# List the sheet names
microbes_xl.sheet_names
pandas.io.excel._base.ExcelFile
['microbes', 'Lists', 'All alphabetised', 'Top titles', 'dropoff']
# Take our excel object and parse a specific file
microbes_xl.parse(sheet_name = "microbes")
# Close the file when we're done with it. This releases it from memory.
microbes_xl.close()
# Here's equivalent code if you only want one sheet at a time
pd.ExcelFile('miscellaneous.xlsx').parse(sheet_name = "microbes")
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
dictionary of DataFrame objects¶Suppose you wanted to select multiple sheets. The read_excel() function will parse through a list [ ] of sheet indices (zero-indexed of course!) and names to store them as entries in a dictionary object using the sheet_name arguments as keys. With this parameter you can:
None to open all sheets in your file# Read in multiple sheets
microbes_xl_dict = pd.read_excel('miscellaneous.xlsx', sheet_name=[1, 2, "dropoff"])
# What is the object type?
type(microbes_xl_dict)
# Look at the keys
microbes_xl_dict.keys()
dict
dict_keys([1, 2, 'dropoff'])
microbes_xl_dict[2]
| Title | No of mentions | |
|---|---|---|
| 0 | 1974 | NaN |
| 1 | 1977 | NaN |
| 2 | 1984 | NaN |
| 3 | 1984 | NaN |
| 4 | 1984 | NaN |
| ... | ... | ... |
| 1998 | Zazie in the Metro | 2.0 |
| 1999 | Zen and the Art of Motorcycle Maintenance | NaN |
| 2000 | Zen and the Art of Motorcycle Maintenance | 2.0 |
| 2001 | Zorba the Greek | NaN |
| 2002 | Zuleika Dobson | NaN |
2003 rows × 2 columns
# load all sheets from a book
pd.read_excel('miscellaneous.xlsx', sheet_name=None, header=0).keys() # Header uses row numbers with 0 indexation
dict_keys(['microbes', 'Lists', 'All alphabetised', 'Top titles', 'dropoff'])
While there are many options for opening excel files, importing directly with Pandas is most convenient. However, other file types will require different methods. |
# Import the 5th, 2nd, and 3rd sheets (in that order) from miscellaneous.xlsx
Next, we will import the HMP files we discussed at the top of class and reformat them in such way that we can join the two files into a single, long-format Pandas data frame. The image below is a screenshot of the two files that make up of the HMP dataset: One file with metadata and OTU counts, and one with the OTU-taxa information. How would you do the joining? What could we use as a key to join the two datasets? (Think of this question as "what do the two datasets have in common?).
Importing and exploring your dataset, and designing a data wrangling strategy, is something that you should always do before the start of reshaping your dataset.
Let's take a quick minute to explore the files using Excel and to propose a data wrangling strategy or even start reshaping your data. Avoid scrolling down for now as the answers are in this notebook. Also, make sure to make copies of the two files and use those for your exploration. (Data inspection is also done in Python, of course, but for the sake of this exercise, use Excel.)